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We propose a computationally intensive method, the random 
lasso method, for variable selection in linear models. The method con- 
sists of two major steps. In step 1, the lasso method is applied to many 
bootstrap samples, each using a set of randomly selected covariates. 
A measure of importance is yielded from this step for each covariate. 
In step 2, a similar procedure to the first step is implemented with 
the exception that for each bootstrap sample, a subset of covariates is 
randomly selected with unequal selection probabilities determined by 
the covariates' importance. Adaptive lasso may be used in the second 
step with weights determined by the importance measures. The final 
set of covariates and their coefficients are determined by averaging 
bootstrap results obtained from step 2. The proposed method allevi- 
ates some of the limitations of lasso, elastic-net and related methods 
noted especially in the context of microarray data analysis: it tends to 
remove highly correlated variables altogether or select them all, and 
maintains maximal flexibility in estimating their coefficients, particu- 
larly with different signs; the number of selected variables is no longer 
limited by the sample size; and the resulting prediction accuracy is 
competitive or superior compared to the alternatives. We illustrate 
the proposed method by extensive simulation studies. The proposed 
method is also applied to a Glioblastoma microarray data analysis. 



1. Introduction. Suppose the training data set consists of n observations 
(xi,yi),...,(xj,yj),...,(x n ,y n ), where Xj = (x a , . . . ,x ip )' is a p-dimensional 
vector of predictors and is the response variable. We consider the following 
linear model in this article: 

(i-i) yi = Pixa-\ vPpXi P + ei, 
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where e% is the error term with mean zero. We assume that the response 
and the predictors are mean-corrected, so we can exclude the intercept term 
from model (1.1). 

Our motivating application comes from the area of microarray data anal- 
ysis [Horvath et al. (2006)], which embodies some of the properties of the 
model (1.1) in many modern applications: 

1. In a typical microarray study, the sample size n is usually on the or- 
der of 10s, while the number of genes p is on the order of 1000s or even 
10,000s. For example, in the glioblastoma microarray gene expression study 
of Horvath et al. (2006), the sample sizes of the two data sets are 55 and 65, 
respectively, while the number of genes considered in their analysis is 3600. 

2. Microarray data analysis typically combines predictive performance 
and model interpretation as its goals: one seeks models which explain the 
phenotype of interest well, but also identify genes, pathways, etc. that might 
be involved in generating this phenotype. 

Shrinkage in general, and variable selection in particular, feature promi- 
nently in such applications. Significantly decreasing the number of variables 
used in the model from the original 1000's to a more manageable number 
by identifying the most useful and predictive ones usually facilitates both 
improved accuracy and interpretation. 

Variable selection has been studied extensively in the literature; see Breiman 
(1995), Tibshirani (1996), Fan and Li (2001), Zou and Hastie (2005) and 
Zou (2006), among many others. In particular, the lasso method proposed 
by Tibshirani (1996) has gained much attention in recent years. 

The lasso criterion penalizes the Li-norm of the regression coefficients: 



where A is a nonnegative tuning parameter. Owing to the singularity of 
the derivative of Li-norm penalty at f3j = 0, lasso continuously shrinks the 
estimated coefficients toward zero, and some estimated coefficients will be 
exactly zero when A is sufficiently large. 

Although lasso has shown success in many situations, it has two limita- 
tions in practice [Zou and Hastie (2005)]: 

1. When the model includes several highly correlated variables, all of 
which are related to some extent to the response variable, lasso tends to 
pick only one or a few of them and shrinks the rest to 0. This may not be 
a desirable feature. For example, in microarray analysis, expression levels of 
genes that share one common biological pathway are usually highly corre- 
lated, and these genes may all contribute to the biological process, but lasso 
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usually selects only one gene from the group. An ideal method should be 
able to select all relevant genes, highly correlated or not, while eliminating 
trivial genes. 

2. When p> n, lasso can identify at most only n variables before it satu- 
rates. This again may not be a desirable feature for many practical problems, 
particularly microarray studies, for it is unlikely that only such a small num- 
ber of genes are involved in the development of a complex disease. A method 
that is able to identify more than n variables should be more desirable for 
such problems. 

Several methods have been proposed recently to alleviate these two pos- 
sible limitations of lasso mentioned above, including the elastic-net [Zou 
and Hastie (2005)], the adaptive lasso [Zou (2006)], the relaxed lasso [Mein- 
shausen (2007)] and VISA [Radchenko and James (2008)]. In particular, Zou 
and Hastie (2005) proposed the elastic-net method, a penalized regression 
with the mixture of the Li-norm and the L2-norm penalties of the coeffi- 
cients: 



where Ai and A2 are two nonnegative tuning parameters. Similar to lasso, 
the elastic-net method also simultaneously does automatic variable selection 
and continuous shrinkage. Due to the nature of the L2-norm penalty, that 
is, the ridge regression penalty, the number of selected variables is no longer 
limited by the sample size. However, the ridge penalty forces the estimated 
coefficients of highly correlated predictors to be close to each other. This 
feature helps to select or remove highly correlated variables altogether if 
their coefficients are truly close to each other, but it loses the ability of esti- 
mating coefficients of highly correlated variables with different magnitudes, 
particularly with different signs, which is not rare in practical problems. As 
a simple illustrative example, eggs are rich in both protein and cholesterol 
that have quite different effects to human health. When we consider the 
impact of egg consumption to human health, we have two highly correlated 
variables with opposite effects. In this scenario, forcing the estimated coef- 
ficients of protein and cholesterol to be the same will cause big biases, and 
is not expected to have adequate prediction performance. 

Another modification of the lasso method is the adaptive lasso proposed 
by Zou (2006), which penalizes the weighted Li-norm of the regression co- 
efficients: 



(1.3) 
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where Wj = \$j ls \~ r for a constant r > 0, and /3? ls is the classical ordinary 
least squares (OLS) estimator for /3j. Adaptive lasso possesses some nice 
asymptotic properties that lasso does not have. When p is fixed, n tends to 
oo and A approaches zero with a certain rate, Zou (2006) has shown that the 
adaptive lasso approach selects the true underlying model with probability 
tending to one, and the corresponding estimated coefficients have the same 
asymptotic normal distribution as they would have if the true underlying 
model were provided in advance. This is called the "oracle" property by Fan 
and Li (2001), a property of super-efficiency Although adaptive lasso has 
nice asymptotic properties, its finite sample performance does not always 
dominate lasso because it heavily depends on the precision of the OLS esti- 
mation. In his Table 2, Zou [(2006), page 1424] presented that the prediction 
performance of adaptive lasso can be worse than lasso when the OLS esti- 
mation is more variable. In practice, adaptive lasso suffers (sometimes more 
severely than lasso) from the multicollinearity caused by large correlations 
among covariates because OLS estimates are very unstable in this situa- 
tion. In addition, due to the intrinsic constraint of the Li-norm penalty, the 
number of variables selected by adaptive lasso cannot exceed n. 

In this article we propose a novel extension of the lasso method, which 
we call the random lasso method. The proposed method can handle highly 
correlated variables in a more flexible manner than elastic-net, especially 
when their effects have different magnitudes and signs, and also can select 
more variables than the sample size. Our experiments below demonstrate 
that the combination of variable selection quality, estimation accuracy, and 
prediction quality offered by the random lasso is consistently competitive 
with, and often significantly superior, to those of all the approaches men- 
tioned above. The main price one pays for using the random lasso, however, 
is in significantly increased computational complexity. 

The rest of the paper is organized as follows. We introduce the proposed 
random lasso method in Section 2, and demonstrate the method using sim- 
ulation studies in Section 3. In Section 4 we analyze a real data example, 
and in Section 5 we provide a summary of the proposed method. 

2. Random lasso. As mentioned above, one of the limitations of lasso is 
that it can select only one or a few of a set of highly correlated important 
variables. If several independent data sets were generated from the same 
distribution, then we would expect lasso to select nonidentical subsets of 
those highly correlated important variables from different data sets, and our 
final collection may be most, or perhaps even all, of those highly correlated 
important variables by taking a union of selected variables from different 
data sets. Such a process may yield more than n variables, overcoming the 
other limitation of lasso. 
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In practice, however, we have only a single data set at hand, and splitting 
the available data set into small pieces is not an efficient way of using data. 
The bootstrap may yield desirable perturbations similar to that of multiple 
data sets. Because each bootstrap sample may include only a subset of the 
highly correlated variables, the bootstrap has the ability to break down 
some of the correlations. Hence, for each bootstrap sample, we can randomly 
select q candidate variables, with q <p. This becomes the basic idea of the 
proposed random lasso approach that has a similar flavor to the random 
forest method; see Breiman (2001). We also acknowledge that Park and 
Hastie (2008) proposed using bootstrap to provide a measure of how likely 
the predictors were to be selected and examine what other predictors could 
have been preferred. An obvious idea is to build on Park and Hastie's idea to 
construct a complete predictive modeling tool which may be termed "Bagged 
Lasso." Our algorithm may be considered a more evolved and "adaptive" 
version of this idea. In the experiments below we discuss the effects of this 
added complexity on performance. 

Our proposed algorithm is a two-step approach and is described below. 
In each step, bootstrap samples are drawn to yield the desired perturbation 
similar to that of multiple data sets. To give the method the most flexibility, 
we allow different numbers of randomly selected variables to be included 
in the model, that is, q± candidate variables are randomly selected in each 
bootstrap sample of the first step, and q2 candidate variables are randomly 
selected in each bootstrap sample of the second step, where q± and <?2 are 
treated as two tuning parameters that can be chosen as large as p. 

Algorithm ("Generate" and "Select"). Step 1. Generating importance 
measures for all coefficients: 

la. Draw B bootstrap samples with size n by sampling with replacement 
from the original training data set. 

lb. For the b\th. bootstrap sample, b\ € {1, ...,B}, randomly select q\ 

candidate variables, and apply lasso to obtain estimators /3j bl ^ for f3j, j = 
l,...,p. Estimators are zero for coefficients of those unselected variables, 
either outside the subset of q\ variables, or excluded by lasso. 

lc. Compute the importance measure of Xj by Ij = \B^ J2b 1= i fif 1 ^ I- 

Step 2. Selecting variables. 

2a. Draw another set of B bootstrap samples with size n by sampling 
with replacement from the original training data set. 

2b. For the 62th bootstrap sample, 62 £ {!?■•■ randomly select qi 
candidate variables with selection probability of Xj proportional to its im- 
portance Ij obtained in step lc, and apply lasso (or adaptive lasso) to ob- 
tain estimators df 2 ^ for B», j = 1, . . . , p. Estimators are zero for coefficients 
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of those unselected variables, either outside the subset of q<i variables, or 
excluded by lasso. 

2c. Compute the final estimator j3j of (3j by = B~ l J2b 2 =i Pj- 

In step lc, we would like to generate an importance score for each pre- 
dictor to assist variable selection and coefficient estimation in the second 
step. The average coefficient for each predictor over all bootstrap samples 
is our choice as an importance score. The intuition is that, for an unimpor- 
tant variable, the estimated coefficients in different bootstrap samples are 
likely to be small or even have different signs, so the corresponding average 
will typically be close to zero. For an important variable, however, the esti- 
mated coefficients in different bootstrap samples are likely to be consistently 
large, and the corresponding average is also large. Therefore, we choose the 
absolute value of the average as the importance score for each predictor. 

In step 2b, there are several choices of weights if adaptive lasso is applied, 
for example, 

Wj =l/\pf s \ r , Wj = l/0f^\ r or Wj = l/0f\ r , 

where /3° ls is the OLS estimator (if p < n), /3" dgc is the ridge regression esti- 
mator, PJ m is the univariate estimator, and r is a positive number. Instead, 
we use importance measures obtained in step 1 as the weights for adaptive 
lasso in our numerical examples and find it works well. 

In practice, we need to choose the number of bootstrap samples B, the 
number of candidate variables to be included in each bootstrap sample q\ 
and (/2, and the tuning parameter A for (adaptive) lasso to each bootstrap 
sample. Based on our experience, our algorithm performs similarly when 
B is large. One can take B = 500 or B = 1000, for example. We can use 
cross-validation (CV) to select q\ and q2, and either CV or generalized 
cross-validation (GCV) to select A. In the following simulations, we use 
independent validation data sets. 

3. Simulation studies. In this section we use simulations to demonstrate 
the proposed random lasso method, and compare to a large collection of 
other methods. Data are generated from model (1.1) with Xy ~ N(0, 1) and 
£;~7V(0,a 2 ). 

Five examples are considered. Examples 1 and 2 were used in the lasso 
paper by Tibshirani (1996), the adaptive lasso paper by Zou (2006), and the 
elastic-net paper by Zou and Hastie (2005). In Examples 3 and 4, the coeffi- 
cients of some highly correlated variables have different signs. In Example 5 
the number of variables with nonzero coefficients is larger than the sample 
size. The following are the details of the five examples. 
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Example 1. There are p = 8 variables. The pairwise correlation between 
Xj 1 and Xj 2 is set to be J2) = 0.5 1- 7 1 - 72 ' . We let 

P = (3,1-5,0,0,2,0,0,0). 

Following Zou (2006), we consider three values of a: a € {1,3,6}. The cor- 
responding signal-to-noise ratios (SNR) are 21.3, 2.4 and 0.6, respectively, 
where the SNR is defined as Var(X'/3)/ Var(e). 

Example 2. We use the same model in Example 1 but with /3j = 0.85 
for all j. We also consider the same three values of a as in Example 1. The 
corresponding signal-to-noise ratios (SNR) are 14.5, 1.6 and 0.4, respectively. 

Example 3. There are p = 40 variables. The first 10 coefficients are 
nonzero. The correlation between each pair of the first 10 variables is set 
to be 0.9. The remaining 30 variables are independent with each other, and 
also independent with the first 10 variables. We let 

= (3, 3, 3, 3, 3, -2, -2, -2, -2, -2, 0, . . . , 0), 

and o" = 3. The SNR is about 3.2. 

Example 4. There are p = 40 variables. The first six coefficients are 
nonzero. The pairwise correlation between the first three variables is set to 
be 0.9, and the same correlation structure is also set for the second three 
variables. The remaining 34 variables are independent from each other. The 
first three variables, the second three variables, and the remaining 34 vari- 
ables are independent from each other. We let 

/3 = (3,3,-2,3,3,-2,0,...,0), 

and (7 = 6. The SNR is about 0.9. 

Example 5. There are p= 120 variables. The first 60 coefficients are 
nonzero and drawn from N(3, 0.5), and their values are then fixed for all sim- 
ulation runs. The remaining 60 coefficients are set to be zero. The covariate 
matrix is generated from a multivariate normal distribution with zero mean 
and covariance matrix as 

/So \ 

S 0.2J 

0.2J S ' 
\ S / 

where So is a 30 x 30 matrix with unit diagonal elements and off-diagonal 
elements of value 0.7, and J is a 30 x 30 matrix with all unit elements. 
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For Examples 1-4, we consider two sample sizes: n = 50 and n = 100. 
For Example 5, since the purpose is to study the performance of methods 
under the situation with p > n, we consider only sample size n = 50. For 
each example, we also generate a validation data set with the same sample 
size as the training data set. Models are fitted on training data only, and the 
validation data are used for selecting the tuning parameters that minimize 
the prediction error within their context respectively. Regarding the number 
of bootstrap samples, we used B = 200. We also tried B = 500; the results 
are similar to those of B = 200. 

We calculate the relative model error (RME) given below to evaluate the 
prediction performance of each predictive model. Suppose that the fitted 
coefficient vector is j3 and the true coefficient vector is /3°, then the relative 
model error is defined as follows: 

Relative Model Error = - /3°)'£(/3 - /?°)/cx 2 , 

where £ is the covariance matrix of the predictors, and a is the standard 
deviation of the error term in model (1.1). 

We repeat the simulation 100 times and compute the average of RMEs 
and their standard errors. We also record how frequently each variable is 
selected during the 100 simulations. For the variable selection of random 
lasso, since the final estimator is the average over all bootstrap samples, it 
is very easy for a variable to have a nonzero coefficient if it has a nonzero 
coefficient in any particular bootstrap sample. So it is a little unfair to use 
zero or nonzero as the variable selection criterion for random lasso. In this 
paper we introduce a threshold t n , and consider a variable Xj to be selected, 
only if the corresponding coefficient \(3j \ > t n . In the following simulation 
studies, we chose t n = l/n, where n is the sample size of the training data. 
We compare the performance in prediction accuracy and variable selection 
frequency of random lasso with the following methods: OLS, lasso, adaptive 
lasso, elastic-net and two other recently developed methods: relaxed lasso 
[(Meinshausen (2007)] and VISA [Redchenko and James (2008)]. In Example 
5, since n <p, the OLS estimator is not unique, so we fitted ridge regression, 
and used the inverse of the absolute value of the ridge regression estimator 
as the weight for adaptive lasso. Results are summarized in Tables 1 and 2. 

As we can see from Table 1, shrinkage methods perform much better 
than OLS in most cases. This illustrates that some regularization is crucial 
in achieving higher prediction accuracy. We also see that random lasso has 
competitive RMEs with all other methods in Examples 1 and 2, except 
perhaps when compared to elastic-net on Example 2. However, one should 
keep in mind that Example 2 represents the motivating setup for the elastic- 
net and, thus, this result is not surprising. Random lasso has consistently 
smaller RMEs than all other regularization methods in Examples 3-5. It also 
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(3, 29, 61) 


(2, 28, 60) 


(93, 98, 100) 


UV 


(14, 20, 30) 


(6, 11, 18) 


(6, 13, 18) 


(7, 9, 15) 


(4, 7, 14) 


(10, 17, 24) 


n = 100 














IV 


(45, 69, 95) 


(68, 82, 93) 


(51, 76, 99) 


(39, 62, 88) 


(38, 62, 88) 


(98, 99, 99) 


UV 


(43, 52, 55) 


(15, 21, 31) 


(29, 35, 40) 


(27, 36, 42) 


(27, 37, 43) 


(22, 30, 37) 


Example 4 














?i = 50 














IV 


(11, 70, 77) 


(16, 49, 59) 


(63, 92, 96) 


(4, 63, 70) 


(4, 62, 73) 


(84, 96, 97) 


UV 


(12, 17, 25) 


(4, 8, 14) 


(9, 17, 23) 


(0, 4, 9) 


(1, 3, 8) 


(11, 21, 30) 


71=100 














IV 


(8, 84, 88) 


(17, 62, 72) 


(70, 98, 99) 


(3, 75, 84) 


(3, 76, 85) 


(89, 99, 99) 


UV 


(12, 22, 31) 


(4, 10, 14) 


(7, 14, 21) 


(1, 3, 8) 


(1, 4, 9) 


(8, 14, 21) 


Example 5 














IV 


(19, 30, 40) 


(15, 25, 35) 


(40, 50, 61) 


(14, 23, 34) 


(16, 27, 35) 


(76, 86, 95) 


UV 


(3, 8, 14) 


(0, 7, 11) 


(1, 5, 8) 


(0, 3, 8) 


(0, 2, 8) 


(18, 29, 38) 



Notes: Since OLS always includes all variables, it is excluded from the comparison. IV — important variables; UV — unimportant variables. 
The three numbers in each pair of parentheses are the min, median, and max of selection frequencies among all important or unimportant 
variables, respectively. 
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has the highest important variable selection frequency (see Table 2). In fact, 
random lasso selects most of the important variables all the time. It also 
has competitive performance in removing unimportant variables compared 
to other methods in Examples 1, 3 and 4. In Example 5 random lasso selects 
more unimportant variables than other methods, but it also selects almost 
all important variables while other methods perform poorly on this aspect. 

It is interesting to compare the elastic net and the random lasso in terms 
of the signs of the estimated nonzero coefficients of the important variables in 
Examples 3 and 4. In these two examples, the important variables are highly 
correlated but with different signs. The result is summarized in Tables 4 and 
5. We can see that random lasso has much better performance in estimating 
correct signs for truly negative coefficients, and much smaller estimation 
bias than the elastic net method. 

For random lasso, the q\ and q2 selection can be crucial. For Examples 
1 and 2, we select the optimal q\ and q2 based on the validation data set 
among values 2, 4, 6 and 8, for Examples 3 and 4, we select the optimal q\ 
and q2 among values 4, 8, 12, 16, 20, 24 and 28, and for Example 5, we select 
the optimal q\ and q2 among values 5, 10, 15, 20 and 25. We also summarize 
the frequency for the selected q\ and q2 in Examples 1 and 2 with sample 
size n = 50 (see Table 3) . From the last columns and the last rows of the 
six sub-tables, we can see that random lasso prefers a smaller number of 
predictors in both the first stage and the second stage of the algorithm, as 
a becomes larger (correspondingly, the signal-to- noise ratio is smaller) . This 
illustrates that the random subset selection of variables in each bootstrap 
sample can be helpful, when the signal-to-noise ratio is small. 

It should be noted that we also experimented with "Bagged Lasso" (that 
is, a 1-step bootstrap approach with q= p) on all simulations. The results 
were reasonable and, in fact, very similar to the elastic net results on all 
setups. However, since these results are clearly inferior overall to the random 
lasso, we chose to eliminate them to avoid clutter. 

4. Glioblastoma gene expression data analysis. We analyze the data 
from a glioblastoma microarray gene expression study conducted by Hor- 
vath et al. (2006) by using the proposed random lasso method and compare 
with the lasso, adaptive lasso, relaxed lasso, elastic-net and VISA methods. 

Glioblastoma is the most common primary malignant brain tumor of 
adults and one of the most lethal of all cancers. Patients with this dis- 
ease have a median survival of 15 months from the time of diagnosis despite 
surgery, radiation, and chemotherapy. Global gene expression data from two 
independent sets of clinical tumor samples of n = 55 and n = 65 are ob- 
tained by high-density Affymetrix arrays. Expression values of 3600 genes 
are available. Among the first set of 55 patients, five were alive at the last 
followup and four were alive in the second set. In our analysis, we exclude 
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Example 2: 
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11 


14 
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11 


27 


62 


100 
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18 
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these nine censored subjects, and use the logarithm of time to death as the 
response variable. The first data set is used as the training set and the second 
data set as the test set. 



Table 4 

Coefficient and coefficient sign estimation of elastic net and random lasso for Example 3 





01 


02 


03 


04 


05 


06 


0T 


08 


09 


010 


True coef. 


3 


3 


3 


3 


3 


-2 


-2 


-2 


-2 


-2 


Enet (n = 50) 






















Ave. of est. 


1.03 


1.06 


0.91 


1.04 


0.98 


-0.05 


-0.03 


-0.03 


0.01 


0.04 




(0.07) 


(0.07) 


(0.06) 


(0.08) 


(0.07) 


(0.06) 


(0.04) 


(0.05) 


(0.03) 


(0.02) 


Freq. (%) of pos. sgn. 


94 


91 


92 


95 


91 


23 


16 


17 


19 


27 


Freq. (%) of neg. sgn. 
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RLasso (n = 50) 
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1.84 


2.01 


1.75 


1.81 


1.84 


-0.84 


-0.89 


-0.88 


-0.91 


-0.83 




(0.12) 


(0.12) 


(0.12) 


(0.11) 


(0.11) 


(0.09) 


(0.07) 


(0.07) 


(0.07) 


(0.07) 


Freq. (%) of pos. sgn. 


98 


99 


96 


98 
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4 
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1.42 


1.54 


1.47 


1.43 
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-0.52 


-0.47 


-0.38 


-0.52 
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(0.09) 


(0.11) 


(0.09) 


(0.09) 


(0.09) 


(0.07) 


(0.09) 


Freq. (%) of pos. sgn. 


98 


99 


98 


97 


98 


15 


20 


17 


17 


19 


Freq. (%) of neg. sgn. 
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34 
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34 
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2.51 


2.45 


2.31 


2.48 


-1.51 


-1.35 
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-1.41 
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(0.09) 


(0.09) 


(0.08) 


(0.09) 


(0.07) 


(0.06) 
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Freq. (%) of pos. sgn. 


99 
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99 


98 


99 
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Table 5 

Coefficient and coefficient sign estimation of elastic net and random lasso for Example 4 





pi 


pi 


P3 
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P4 


P5 


P6 


True coef. 


3 


3 


-2 


3 


3 


-2 


Enet (n = 50) 














Ave. of est. 


1.30 


1.44 


0.51 


1.75 


1.47 


0.74 




(0.07) 


(0.08) 


(0.06) 


(0.09) 


(0.07) 


(0.07) 


No. of pos. sgn. 


92 


94 


63 


96 


92 


70 


No. of neg. sgn. 

















1 


RLasso (n = 50) 














Ave. of est. 


1.85 


1.68 


-0.17 


2.01 


1.89 


-0.17 




(0.12) 


(0.13) 


(0.07) 


(0.14) 


(0.13) 


(0.09) 


No. of pos. sgn. 


98 


90 


33 


91 


96 


38 


No. of neg. sgn. 


1 


7 


65 


5 


2 


57 


Enet (n = 100) 














Ave. of est. 


1.57 


1.57 


0.54 


1.69 


1.67 


0.61 




(0.06) 


(0.07) 


(0.05) 


(0.06) 


(0.06) 


(0.05) 


No. of pos. sgn. 


97 


98 


69 


98 


99 


72 


No. of neg. sgn. 




















RLasso (n = 100) 














Ave. of est. 


2.25 


1.91 


-0.57 


2.28 


2.08 


-0.55 




(0.06) 


(0.07) 


(0.05) 


(0.06) 


(0.06) 


(0.05) 


No. of pos. sgn. 


99 


97 


17 


100 


99 


15 


No. of neg. sgn. 





2 


81 








83 



We first assess each of the 3600 genes by running simple linear regression 
on the training set, and then select 1000 genes with the smallest p-values. 
Starting with these 1000 genes, we fit a linear regression model by the pro- 
posed random lasso method on the training set, and select 58 genes. Table 7 
lists the gene symbol and estimated coefficient for each of these 58 genes. The 
model with these selected 58 genes is then used to predict the log-survival 
times for subjects in the test set. We also analyze the data using other lasso- 

Table 6 

Analysis of the glioblastoma data set 



Method No. of genes selected Mean prediction error 



Lasso 


29 


1.118 (0.205) 


Adaptive lasso 


33 


1.143 (0.211) 


Relaxed lasso 


23 


1.054 (0.194) 


Elastic-net 


28 


1.113 (0.204) 


VISA 


15 


0.997 (0.188) 


Random lasso 


58 


0.950 (0.210) 
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Table 7 

Gene symbol and estimated coefficient for each of the 58 genes selected by random lasso 
based on 50 subjects in the training set 



Gene symbol 


Estimated coefficients 


Gene symbol 


Estimated coefficients 


VSNLl 


-3.839 


KIAA0194 


-0.039 


SNAP25 


-1.561 


MOBP 


-0.033 


UBE2D3 


-0.382 


PTGDS 


-0.028 


ARF4 


-0.341 


KIF5A 


-0.024 


CSNK1A1 


-0.319 


GORASP2 


-0.021 


C13orfll 


-0.312 


ME2 


-0.019 


CHGA 


-0.310 


CGI-141 


-0.019 


Cllorf24 


-0.223 


p25 


-0.017 


OPTN 


-0.221 


UGT8 


-0.016 


UNC84B 


-0.176 


CKMT1 


-0.014 


S100A1 


-0.157 


KIF1A 


-0.013 


KCNS1 


-0.155 


KCNAB2 


-0.012 


INF Y 


—0.124 


03orr4 


—0.011 


TIP-1 


-0.107 


DNASE1L1 


-0.011 


FAIM2 


-0.086 


RNF44 


0.011 


FSTL3 


-0.074 


ATP6V1B2 


0.012 


NEFH 


-0.072 


POLR3E 


0.012 


CTSK 


-0.071 


LIN7C 


0.014 


RGS3 


-0.071 


GBP2 


0.015 


PGCP 


-0.070 


CSF1R 


0.018 


FLJ20254 


-0.059 


JIK 


0.019 


ANXA2 


-0.053 




0.019 


FLJ11155 


-0.052 


CIS 


0.026 


P2RX4 


-0.049 


ARHGAP15 


0.040 


GPNMB 


-0.044 


PPM1H 


0.063 


ICAM5 


-0.043 


MARK4 


0.071 


ADIPOR1 


-0.043 


HPCAL4 


0.196 


BSCL2 


-0.042 


SULT4A1 


0.785 


AMBP 


-0.042 


BSN 


2.662 



related methods starting with the same 1000 genes on the training set and 
evaluate obtained models using the test set. 

Table 6 shows the number of genes selected by each of these six methods 
in the training set and corresponding mean prediction error in the test set. 
We can see that random lasso has the smallest prediction error. It also selects 
more genes than the other five methods. Among the 58 genes selected by 
random lasso, 7 genes are also selected by lasso, adaptive lasso, relaxed 
lasso, VISA and elastic-net (for adaptive lasso, the adaptive weights were 
calculated using ridge regression). 

Several genes identified by the proposed method are of interest. VSNLl, 
RGS3 and S100A4 are identified to be negatively associated with the pa- 
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tients' survival. VSNL1 is a member of the visinin/recoverin subfamily of 
neuronal calcium sensor proteins. The encoded protein is strongly expressed 
in granule cells of the cerebellum where it associates with membranes in a 
calcium-dependent manner and modulates intracellular signaling pathways 
of the central nervous system by directly or indirectly regulating the activity 
of adenylyl cyclase. A previous study [Xie et al. (2007)] has demonstrated 
that VSNL1 plays a very important role in neuroblastoma metastasis, and 
VSNL1 mRNA in highly invasive cells is significantly higher than that in 
lowly invasive cells. RGS3 encodes a member of the regulator of the G- 
protein signaling (RGS) family. This protein is a GTP-ase activating pro- 
tein which inhibits G-protein mediated signal transduction. Tatenhorst et 
al. (2004) demonstrated that glioma cell clones overexpressing RGS3 showed 
an increase of both adhesion and migration. S100A4 encodes a member of 
the S100 family of proteins, which are localized in the cytoplasm and/or 
nucleus of a wide range of cells, and involved in the regulation of a number 
of cellular processes such as cell cycle progression and differentiation. It is 
conjectured that the protein encoded by S100A4 may function in motility, 
invasion, and metastasis [Zou et al. (2005)]. VSNL1, RGS3 and S100A4 were 
also identified to be related to the poor survival of brain tumor patients in 
Freije et al. (2004). BSN is identified to be positively associated with the 
patients' survival. This gene is expressed primarily in neurons in the brain, 
and the protein encoded by this gene is thought to be a scaffolding protein 
involved in organizing the presynaptic cytoskeleton. Additional studies will 
be required to establish the direct relationships between the expression of 
these genes and the Glioblastoma tumor behavior. 

It is also interesting to observe that estimated coefficients of VSNL1 and 
BSN (—3.839 and 2.662, resp.) have different signs, but the correlation be- 
tween the expression levels for these two genes in the training set is very 
high (p = 0.96). Neither lasso nor elastic-net picked up these two genes. It 
is worth conducting more detailed experiments to further explore the con- 
nection between VSNL1 and BSN, and their relations to the Glioblastoma 
tumor behavior. 

5. Conclusion. We have proposed the random lasso method for variable 
selection. The idea of random lasso is mimicking the random forest method 
[Breiman (2001)] for linear regression models. By drawing bootstrap samples 
from the original training set and randomly selecting candidate variables, 
the average of the predictive models based on multiple bootstrap samples 
alleviates two possible limitations of lasso. It tends to select or remove highly 
correlated variables more efficiently and has more flexibility in estimating 
their coefficients than the elastic-net method. The number of variables se- 
lected by random lasso is not limited by the sample size. Simulation studies 



18 



WANG ET AL. 



show that the proposed random lasso method has good prediction perfor- 
mance compared to a large set of competitor approaches, and the analysis 
of Glioblastoma microarray data set demonstrates the usefulness of the pro- 
posed method in practice. 
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